Data manipulation

tidyverse, grouping and formulas

Gerko Vink

Methodology & Statistics @ Utrecht University

Statistical programming with R team

Summerschool @ Utrecht University

5 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Recap

Gisteren hebben we deze onderwerpen behandeld:

  • Importeren en bestuderen van datasets
  • Begrijpen en toepassen van verschillende datatypes en database formats
  • Variabelen labelen en (her)coderen
  • De blauwdruk van R: frames en environments
  • Pipes
  • Formules gebruiken in functies

Today

Vandaag leren we:

  • Het combineren van datasets
  • Groeperen en aggregeren
  • Nieuwe variabelen creëren
  • Filteren en sorteren van gegevens
  • Het maken en aanpassen van datagroepen
  • Clustering van gegevens

Make some data

planet <- c("Mercury", "Venus", "Earth", "Mars", 
                "Jupiter", "Saturn", "Uranus", "Neptune")

planet_type <- c("Terrestrial planet", "Terrestrial planet", 
              "Terrestrial planet", "Terrestrial planet", "Gas giant", 
              "Gas giant", "Gas giant", "Gas giant")

diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)

rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)

rings    <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

planets <- data.frame(planet_type = factor(planet_type), 
                      diameter, rotation, rings, 
                      row.names = planet)

Datasets combineren

Joining data

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Inner join

With an inner join, we combine two data frames based on a common key. Only the rows with matching keys in both data frames are kept.

band_members %>% inner_join(band_instruments)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Left join

With a left join, we keep all rows from the left data frame and only the matching rows from the right data frame. If there is no match, the result will contain NA for the columns from the right data frame.

band_members %>% left_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

Right join

With a right join, we keep all rows from the right data frame and only the matching rows from the left data frame. If there is no match, the result will contain NA for the columns from the left data frame.

band_members %>% right_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

Full join

With a full join, we keep all rows from both data frames. If there is no match, the result will contain NA for the columns from the other data frame.

band_members %>% full_join(band_instruments)
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

The tidyverse packages

tidyverse and the data analysis cycle

Tidyverse and the verbs of data manipulation

Leading principle: language of programming should really behave like a language, tidyverse.


tidyverse: a few key verb that perform common types of data manipulation.

Tidy data

The tidyverse packages operate on tidy data:

  1. Each column is a variable

  2. Each row is an observation

  3. Each cell is a single value


Untidy versus tidy data

The dplyr package

Data manipulation with dplyr

The dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data:

  • summary statistics for grouped data
  • selecting variables
  • filtering cases
  • (re)arranging cases
  • computing new variables
  • recoding variables

dplyr cheatsheet

Common dplyr functions

There are many functions available in dplyr, but we will focus on just the following dplyr functions (verbs):

dplyr verbs Description
glimpse() a transposed print of the data that shows all variables
select() selects variables (columns) based on their names
filter() subsets the rows of a data frame based on their values
arrange() re-order or arrange rows
mutate() adds new variables, or new variables that are functions of existing variables
summarise() creates a new data frame with statistics of the variables (optional grouped by another variables)
group_by() allows for group operations in the “split-apply-combine” concept

Check the dplyr cheat sheet for examples.

dplyr::glimpse()

  • Prints a transposed version of the data: variables are the rows, observations are the columns.
  • Makes it possible to see every column in a data frame.
  • It is similar to str(), but shows more data.
  • str() shows more detailed information about data structure.
dplyr::glimpse(planets)
Rows: 8
Columns: 4
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial planet…
$ diameter    <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.883
$ rotation    <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings       <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
str(planets)
'data.frame':   8 obs. of  4 variables:
 $ planet_type: Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
 $ diameter   : num  0.382 0.949 1 0.532 11.209 ...
 $ rotation   : num  58.64 -243.02 1 1.03 0.41 ...
 $ rings      : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

Compute new variables

Compute new variables with dplyr::mutate()

dplyr::mutate() adds a new variable to the data frame.

data %>% 
  dplyr::mutate(..., .keep = c("all", ...), .before = NULL, .after = NULL)


Arguments:

.keep specifies which variables to return, “all”, “used”, “unused”, “none”.

.before or .after determine where the new variables are inserted.

Compute new variables with dplyr::mutate()

data %>% 
  dplyr::mutate(..., .keep = c("all", ...), .before = NULL, .after = NULL)


Example: compute a new variable rotation_diameter = rotation/diameter, add it to the data frame and keep all other variables:

planets %>% 
  dplyr::mutate(rotation_diameter = rotation/diameter, .keep = "all") %>%
  glimpse()
Rows: 8
Columns: 5
$ planet_type       <fct> Terrestrial planet, Terrestrial planet, Terrestrial …
$ diameter          <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.…
$ rotation          <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings             <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
$ rotation_diameter <dbl> 153.50785340, -256.08008430, 1.00000000, 1.93609023,…

Temporary / permanent changes

The pipe operations do not make changes to the original data set, unless you save the results:

Temporary:

planets %>% 
  dplyr::mutate(rotation_diameter = 
                  rotation/diameter)
names(planets)
[1] "planet_type" "diameter"    "rotation"    "rings"      

Changes saved in new data frame:

new_data_set <- planets %>% 
  dplyr::mutate(rotation_diameter = 
                  rotation/diameter) 
names(new_data_set)
[1] "planet_type"       "diameter"          "rotation"         
[4] "rings"             "rotation_diameter"

Filtering and sorting

Select columns with dplyr::select()

Select variables type and diameter from the planets data frame:

planets %>% 
  select(planet_type, diameter)
               planet_type diameter
Mercury Terrestrial planet    0.382
Venus   Terrestrial planet    0.949
Earth   Terrestrial planet    1.000
Mars    Terrestrial planet    0.532
Jupiter          Gas giant   11.209
Saturn           Gas giant    9.449
Uranus           Gas giant    4.007
Neptune          Gas giant    3.883

Select numeric columns with dplyr::select()

Select numerical variables with where(is.numeric):

planets %>% 
  select(where(is.numeric))
        diameter rotation
Mercury    0.382    58.64
Venus      0.949  -243.02
Earth      1.000     1.00
Mars       0.532     1.03
Jupiter   11.209     0.41
Saturn     9.449     0.43
Uranus     4.007    -0.72
Neptune    3.883     0.67

Select numeric columns with dplyr::select()

Select numerical variables with where(is.numeric):

planets %>% 
  select(where(is.factor))
               planet_type
Mercury Terrestrial planet
Venus   Terrestrial planet
Earth   Terrestrial planet
Mars    Terrestrial planet
Jupiter          Gas giant
Saturn           Gas giant
Uranus           Gas giant
Neptune          Gas giant

Select rows with dplyr::filter()

Selects subsets of the rows of a data frame based on their values.

Select the planets that have a ring and that are gas giants:

planets %>% 
  filter(rings == TRUE, 
         planet_type == "Gas giant")
        planet_type diameter rotation rings
Jupiter   Gas giant   11.209     0.41  TRUE
Saturn    Gas giant    9.449     0.43  TRUE
Uranus    Gas giant    4.007    -0.72  TRUE
Neptune   Gas giant    3.883     0.67  TRUE

Select rows and columns

Select the planets that have a ring and that are gas giants:

planets %>% 
  filter(rings == TRUE, 
         planet_type == "Gas giant")
        planet_type diameter rotation rings
Jupiter   Gas giant   11.209     0.41  TRUE
Saturn    Gas giant    9.449     0.43  TRUE
Uranus    Gas giant    4.007    -0.72  TRUE
Neptune   Gas giant    3.883     0.67  TRUE

Re-order rows with dplyr::arrange()

Order the rows of the planets data set on ascending values of diameter:


Original data set:

planets
               planet_type diameter rotation rings
Mercury Terrestrial planet    0.382    58.64 FALSE
Venus   Terrestrial planet    0.949  -243.02 FALSE
Earth   Terrestrial planet    1.000     1.00 FALSE
Mars    Terrestrial planet    0.532     1.03 FALSE
Jupiter          Gas giant   11.209     0.41  TRUE
Saturn           Gas giant    9.449     0.43  TRUE
Uranus           Gas giant    4.007    -0.72  TRUE
Neptune          Gas giant    3.883     0.67  TRUE

Ordered data set, based on diameter:

planets %>% 
  dplyr::arrange(diameter)
               planet_type diameter rotation rings
Mercury Terrestrial planet    0.382    58.64 FALSE
Mars    Terrestrial planet    0.532     1.03 FALSE
Venus   Terrestrial planet    0.949  -243.02 FALSE
Earth   Terrestrial planet    1.000     1.00 FALSE
Neptune          Gas giant    3.883     0.67  TRUE
Uranus           Gas giant    4.007    -0.72  TRUE
Saturn           Gas giant    9.449     0.43  TRUE
Jupiter          Gas giant   11.209     0.41  TRUE

Multiple transformations: base R and dplyr

Suppose we want to perform the following transformations:

  1. Sort the rows of planets on ascending values of rotation
  2. Select only planets with diameter > 1
  3. Display the variables planet_type, diameter and rotation

With base R code:

subset(planets[order(planets$rotation), ],  
       subset = diameter > 1, 
       select = c(planet_type, diameter, 
                  rotation))
        planet_type diameter rotation
Uranus    Gas giant    4.007    -0.72
Jupiter   Gas giant   11.209     0.41
Saturn    Gas giant    9.449     0.43
Neptune   Gas giant    3.883     0.67

With dplyr and the pipe %>% operator

planets %>% 
  dplyr::filter(diameter > 1) %>% 
  dplyr::arrange(rotation) %>% 
  dplyr::select(planet_type, diameter, rotation)
        planet_type diameter rotation
Uranus    Gas giant    4.007    -0.72
Jupiter   Gas giant   11.209     0.41
Saturn    Gas giant    9.449     0.43
Neptune   Gas giant    3.883     0.67

Summary statistics with summarise()

The dplyr function for summarizing data:

planets %>% 
  dplyr::summarise(
    mean_diameter = mean(diameter), 
    sd_diameter = sd(diameter)
  )
  mean_diameter sd_diameter
1      3.926375    4.226738
  • Various summary function(s):
    • mean(), median(), sd(), var(), sum(), for numeric variables
    • n(), n_distinct() for counts
    • many others, see: ?dplyr::select and cheat sheet)

Summaries for groups with group_by()

The dplyr function for grouping rows of a data frame is very useful in combination with summarise()

Example: group the planets based on having rings (or not) and compute the mean and the standard deviation for each group.

planets %>% 
  dplyr::group_by(rings) %>%
  dplyr::summarise(
    mean_diameter = mean(diameter), 
    sd_diameter = sd(diameter)
  )
# A tibble: 2 × 3
  rings mean_diameter sd_diameter
  <lgl>         <dbl>       <dbl>
1 FALSE         0.716       0.306
2 TRUE          7.14        3.76 

Standard solves for missing values

Dealing with missing values in R

Calculations based on missing values (NA’s) are not possible in R:

variable <- c(1, 2, NA, 4, 5)
mean(variable)
[1] NA

There are two easy ways to perform “listwise deletion”:

mean(variable, na.rm = TRUE)
[1] 3
mean(na.omit(variable))
[1] 3

Dealing with missing values with dplyr

df$score
[1]  1  2 NA  4  5

No solution for missing values:

df %>% 
  dplyr::summarise(
    mean_variable = mean(score), 
    sd_variable = sd(score)
  )
  mean_variable sd_variable
1            NA          NA

Use na.rm = TRUE:

df %>% 
  dplyr::summarise(
    mean_variable = mean(score, na.rm = TRUE), 
    sd_variable = sd(score, na.rm = TRUE)
  )
  mean_variable sd_variable
1             3    1.825742

Style guide for coding pipes

Code with a single pipe operator on one line and spaces around %>%:

data %>% dplyr::select(X)

Code with multiple pipe operators on multiple lines:

data %>% 
  dplyr::group_by(X) %>% 
  dplyr::filter(Y > 4) %>% 
  dplyr::summarise(mean(Y))

but definitely NOT:

data%>%dplyr::group_by(X)%>%dplyr::filter(Y>4)%>%dplyr::summarise(mean(Y))

More about coding style: tidyverse style guide

https://style.tidyverse.org/index.html

Practical